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Abstract 

Probabilistic analogues of regular and context-free grammars are well- 
known in computational linguistics, and currently the subject of inten- 
sive research. To date, however, no satisfactory probabilistic analogue 
of attribute-value grammars has been proposed: previous attempts have 
failed to define a correct parameter-estimation algorithm. 

In the present paper, I define stochastic attribute- value grammars and 
give a correct algorithm for estimating their parameters. The estima- 
tion algorithm is adapted from Delia Pietra, Delia Pietra, and Lafferty 
To estimate model parameters, it is necessary to compute the ex- 
pectations of certain functions under random fields. In the application 
discussed by Delia Pietra, Delia Pietra, and Lafferty (representing En- 
glish orthographic constraints), Gibbs sampling can be used to estimate 
the needed expectations. The fact that attribute-value grammars gener- 
ate constrained languages makes Gibbs sampling inapplicable, but I show 
how a variant of Gibbs sampling, the Metropolis-Hastings algorithm, can 
be used instead. 



1 Introduction 



Stochastic versions of regular grammars and context-free grammars have re- 
ceived a great deal of attention in computational linguistics for the last sev- 
eral years, and basic techniques of stochastic parsing and parameter estimation 
have been known for decades. However, regular and context-free grammars are 
widely deemed linguistically inadequate; standard grammars in computational 
linguistics are attribute-value grammars of some variety. Before the advent of 
statistical methods, regular and context-free grammars were considered too in- 
expressive for serious consideration, and even now the reliance on stochastic 
versions of the less-expressive grammars is often seen as an expedient necessi- 
tated by the lack of an adequate stochastic version of attribute- value grammars. 
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Attempts have been made to extend stochastic models developed for the 
regular and context-free cases to attribute-value grammars, but to date with- 
out success.^] Brew sketches a probabilistic version of HPSG, but admits 
that his way of dealing with re-entrancies in feature structures is problematic. 
Eisele [|| attempts to translate stochastic context-free techniques to constraint- 
based grammar by assigning probabilities to SLD proof trees. Both Brew and 
Eisele propose associating weights with grammar-rule analogues (typed feature 
structures in Brew's case; Horn clauses in Eisele's case) and setting weights pro- 
portional to expected rule frequencies. For want of a standard term, I will call 
this the Expected Rule Frequency (ERF) method. Both propose using iterative 
recstimation of rule-frequency expectations when dealing with incomplete data 
(unannotated corpora), along the lines of the EM algorithm. 

The attempt is ultimately unsuccessful. The ERF method is provably cor- 
rect for the context-free case, but it fails in the presence of context dependencies, 
as will be discussed below. Both Brew and Eisele recognize that applying the 
ERF method has deficiencies. Eisele in particular identifies an important symp- 
tom that indicates that something has gone amiss: the grammar induced by 
the EM algorithm defines a probability distribution over trees that is not in 
accordance with their frequency in the training corpus. Moreover, Eisele recog- 
nizes that this problem arises only where there are context dependencies. That 
such dependencies lead to problems is not surprising, given the independence 
assumptions underlying Eisele's model, but he is not able to explain why they 
manifest themselves in the way they do, nor what can be done to address the 
problem. 

Now in fact solutions to the context-sensitivity problem have long been 
known, and are the subject of continuing study, in the image processing field 
and in related areas of statistics. The models of interest are known as ran- 
dom fields. Random fields can be seen as a generalization of Markov chains and 
stochastic branching processes. Markov chains can be seen as stochastic versions 
of regular grammars (Hidden Markov Models are in turn stochastic functions 
of Markov chains) and random branching processes are stochastic versions of 
context-free grammars. The evolution of a Markov chain describes a line, in 
which each stochastic choice depends only on the state at the immediately pre- 
ceding time-point. The evolution of a random branching process describes a 
tree in which a finite-state process may spawn multiple child processes at the 
next time-step, but the number of processes and their states depend only on 
the state of the unique parent process at the preceding time-step. In particular, 

1 I confine my discussion here to Brew and Eisele because they aim to describe parametric 
models of probability distributions over the languages of constraint-based grammars, and to 
estimate the parameters of those models. Other authors have assigned weights or preferences 
to constraint-based grammars but not discussed parameter estimation. One approach of the 
latter sort that I find of particular interest is that of Stefan Riezler [g , who describes a weighted 
logic for constraint-based grammars that characterizes the languages of the grammars as fuzzy 
sets. This interpretation avoids the need for normalization that Brew and Eisele face, though 
parameter estimation still remains to be addressed. 
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stochastic choices are independent of other choices at the same time-step: each 
process evolves independently. If we permit re-entrancies, that is, if we permit 
processes to re-merge, we generally introduce context-sensitivity. In order to 
re-merge, processes must generally be "in synch," which is to say, they cannot 
evolve in complete independence of one another. Random fields are a particular 
class of multi-dimensional random processes, that is, processes corresponding 
to probability distributions over an arbitrary graph. They were originally stud- 
ied by Gibbs, nearly a hundred years ago, as a model for statistical mechanics, 
and the general family of probability distributions involved is still known by his 
name. 

To my knowledge, the first application of random fields to natural language 
was by Mark et al. J3|. The problem of interest was how to combine a stochastic 
context-free grammar with n-gram language models. The resulting structures, 
e.g., (1), obviously involve re-entrancies and context-sensitivity. 

(1) 

NP VP 

I /\ 
there was NP 



no response 

It was clear at that time that a similar approach ought to succeed for general 
attribute-value grammars, but the issue was not pursued. 

Recent work by Delia Pietra, Delia Pietra, and Lafferty || (henceforth, 
DDL) also applies random fields to natural language processing. The application 
they consider is the induction of English orthographic constraints — inducing a 
grammar of possible English words. The authors describe an algorithm for 
selecting informative properties of words to construct a random field, and for 
setting the parameters of the field optimally for a given set of properties, to 
model an empirical word distribution. 

The DDL algorithms require the computation of the expectations, under 
random fields, of certain characteristic functions. In general, computing these 
expectations involves summing over all configurations (all possible character 
sequences, in the orthography application), which is not possible when the con- 
figuration space is large. Instead, DDL use Gibbs sampling to estimate the 
needed expectations. 

The orthography application cannot be immediately converted into a means 
of equipping attribute-value grammars with probabilities. Any labelling of a 
finite linear graphQ with ASCII characters yields a possible (though not neces- 
sarily probable) English word, and this unconstrainedness is essential for the use 
of Gibbs sampling. By contrast, the set of dags admitted by an attribute- value 

2 To be precise, DDL use closed linear graphs — i.e., polygons. 
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grammar G is highly constrained — most of the time, relabelling a dag admitted 
by G does not yield a new dag admitted by G. Gibbs sampling is not applicable. 
However, I will show that a variant of Gibbs sampling, the Metropolis-Hastings 
algorithm, can be used. Indeed, we can use a random branching process much 
like Brew's or Eisele's to supply the so-called proposal matrix for the Metropolis- 
Hastings algorithm. 

In this way, we can assign probabilities to the classes of dags admitted by 
attribute-value grammars. We can use these probabilities to disambiguate sen- 
tences (by selecting the most-probable parse), and we can give a parameter- 
estimation algorithm that is correct, in the sense that, if we generate a training 
corpus of size n from a model M, and then estimate parameters from the train- 
ing corpus to yield a model-estimate M n , then M n converges toMasruoo. 
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2 Stochastic Context-Free Grammars 

Let us begin by examining stochastic context-free grammars and asking why 
the "obvious" generalization to attribute-value grammars fails. A point of ter- 
minology: I will use the term grammar to refer to an unweighted grammar, be 
it a context-free grammar or attribute-value grammar. The combination of a 
grammar and weights (later, also properties) I will refer to as a model. (Occa- 
sionally I will use model to refer to the weights themselves, or the probability 
distribution they define.) 

Throughout we will use the following stochastic context-free grammar for 
illustrative purposes. Let us call the underlying grammar G\ and the grammar 
equipped with weights as shown, My. 
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(2) 1. S^AA fa = 1/2 
2. S -> B /3 2 = 1/2 



3. 


A - 


-> a 


/% 


= 2/3 


4. 


A - 


■+ b 


04 


= 1/3 


■5. 


B - 


-> a a 


ft 


= 1/2 


6. 


B - 


■+ b b 


ft 


= 1/2 



The probability of a given tree is computed as the product of probabilities of 
rules used in it. For example: 



(3) 




Let x be tree (3) and let qi be the probability distribution over trees defined by 
model M x . Then: 

(4) 

qi (x) = 01 '03 '03 

1 2 2 2 
— 2 ' 3 ' 3 9 

In parsing, we use the probability distribution qi (x) defined by model Mi to 
disambiguate: the grammar assigns some set of trees {xi, . . . , x n } to a sentence 
<7, and we choose that tree Xj that has greatest probability q±(xi). For example, 
Gi assigns two parses to the sentence aa: tree (||) above and tree (5): 

(5) S 
B 

a a 

The probability of tree (0) is 2/9, as we have seen. The probability of tree (5) 
is 0205 = 1/2 • 1/2 = 1/4. Since 1/4 > 2/9, a stochastic parser for Mi should 
return tree (5) on input aa. 

The issue of efficiently computing the most-probable parse for a given sen- 
tence has been thoroughly addressed in the literature. The standard parsing 
techniques can be applied as is to the random-field models to be discussed be- 
low, so I simply refer the reader to the literature. Instead, I concentrate on 
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parameter estimation, which for attribute-value grammars cannot be accom- 
plished by standard techniques. 

By parameter estimation we mean determining values for the weights (3. In 
order for a stochastic grammar to be useful, we must be able to compute the 
correct weights, where by correct weights we mean the weights that best account 
for a training corpus. The degree to which a given set of weights account for 
a training corpus is measured by the similarity between the distribution qp(x) 
determined by the weights (3 and the distribution of trees x in the training 
corpus. 



2.1 The Goodness of a Model 

The distribution determined by the training corpus is known as the empirical 
distribution. For example, suppose we have a training corpus containing twelve 
trees of the following four types from L(G\): 

(G) %\ %2 



A 



A 



c = 4x 2x 3x 3x = 12 

p= 4/12 2/12 3/12 3/12 

If a is the count of how often the i-th tree (type) appears in the corpus, then 

P(Xi) = 



In comparing a distribution q to the empirical distribution p, we shall actu- 
ally measure dissimilarity rather than similarity. Our measure for dissimilarity 
of distributions is the Kullback-Leibler distance, defined as: 

(7) 

p(x) 



D(p\\q)=J2P^ ln 



q(x) 



The distance between p and q at point x is the log of the ratio of p(x) to 
q(x). The overall distance between p and q is the average distance, where the 
averaging is over tree (tokens) in the corpus; i.e., point distances \np(x)/q(x) 
are weighted by p{x) and summed. 

For example, let qi be, as before, the distribution determined by model Mi. 
The following table shows qi, p, the ratio qi(x)/p(x), and the weighted point 
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distance p{x)\n(p(x) / q\(x)). The sum of the fourth column is the Kullback- 
Leibler distance D(p\q) between p and q\. The third column contains q\(x)/p{x) 
rather than p{x) jq\ (x) so that one can see at a glance whether q\ (x) is too large 
(qi(x)/p(x) > 1) or too small (< 1). 

(8) qi p qi/p pln(p/gi) 
xi 2/9 1/3 0.67 0.14 

x 2 1/18 1/6 0.33 0.18 
x 3 1/4 1/4 1.00 0.00 
Xi 1/4 1/4 1.00 0.00 

0.32 

The total distance D(p\ qi ) = 0.32. 

One set of weights is better than another if its distance from the empirical 
distribution is less. For example, let us consider a different set of weights for 
grammar G x . Let M' be d with weights (1/2, 1/2, 1/2, 1/2, 1/2, 1/2), and let 
q' be the probability distribution determined by M' . Then the computation of 
the Kullback-Lcibler distance is as follows: 

(9) q' p q'/p p\n(p/q') 
xi 1/8 1/3 0.38 0.33 

x 2 1/8 1/6 0.75 0.05 
x 3 1/4 1/4 1.00 0.00 
x 4 1/4 1/4 1.00 0.00 
0.38 

The fit for x 2 improves, but that is more than offset by a poorer fit for x\. 
The distribution qi is a better distribution than q', in the sense that q\ is more 
similar (less dissimilar) to the empirical distribution than q' is. 

This particular measure of goodness of a set of weights has a number of nice 
properties. For one thing, it is not hard to show that the distribution closest to 
the empirical distribution is identically the maximum likelihood distribution. 

Another reason for adopting the definition of goodness in terms of Kullback- 
Lcibler distance is the following. Suppose Nature secretly chooses some set of 
weights M for G\ . These are the true weights; they define the true distribution q. 
Nature then generates trees at random from M in accordance with q. Let p n be 
the empirical distribution determined by the first n trees that Nature generates. 
A parameter-setting method must choose a model (a set of weights) M n given 
p n , for each n. A parameter-setting method is correct if it converges to M, 
the true model. The sequence of hypotheses Mi, M 2 , ■ ■ ■ defining distributions 
<Ji , <72 , ■ ■ ■ is said to converge to M (defining distribution q) just in case, for 
all tolerances e, there is some point n such that D(q\\q n i) < e for all n' > n. 
It can be shown that D(q\p n ) converges to 0; that is, linin^ooPn = q. If 
a parameter-setting method returns the model M n that minimizes D(p n \\q n ), 
then linin-^oo q n = lim^ooPn, if the limiting distribution for p n is generable 
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by any model with underlying grammar G\. Since q is generable by such a 
grammar, and q is the limit distribution for p n , it follows that q is also the limit 
distribution for q n , and the method is correct. 

Note that the model M that minimizes the distance D(q\\q) is M itself, and 
D{q\\q) = 0. This does not mean, however, that D(p n \\q n ) — for the model 
minimizing D(p n \\q n ). The empirical distributions p n converge to q, but do 
not necessarily equal q. Intuitively, the relative frequency of any given tree 
converges to its true probability, but need not be precisely its true probability, 
even in very large corpora. 



2.2 The ERF Method 

For stochastic context-free grammars, it can be shown that the Expected Rule 
Frequency (ERF) method mentioned in the introduction always yields the best 
model for a given training corpus. To define the ERF method, we require a 
bit of terminology and notation. With each rule i in a stochastic context-free 
grammar is associated a weight Pi and a function fi{x) that returns the number 
of times rule i is used in the derivation of tree x. For example, consider tree 
(||), repeated here as (10): 

(10) p i 




Rule 1 is used once and rule 3 is used twice; accordingly f\(x) — 1, fz{x) = 2, 
and fi{x) = for i G {2,4,5,6}. 

The expectation of a function over a probability space (for each i, fi is such 
a function) is simply the average value of the function. We use the notation p[f] 
to represent the expectation of / under probability distribution p. It is defined 
as: 

p[f\ = ^2p( x )f( x ) 

X 

The ERF method instructs us to choose the weight for rule i proportional 
to the average frequency of rule i in the corpus. That is: 



Pi oc p[fi] 

Algorithmically, we compute the expectation of each rule's frequency, and nor- 
malize among rules with the same lefthand side. For example, consider corpus 
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(^|). The expectation of each rule frequency /, is a sum of terms p(x)fi(x). 
These terms are shown for each tree, in the following table. 



p 


< 

< m 

T T 

CO CO 

Ph Ph 


T T 
< < 

Ph Ph 


ce X3 

T T 
ffl m 

Ph Ph 


xi [s [a a] [ A a]] 1/3 
X2 [s [b a a]] 1/6 
x 3 [s [a b] [ A b]] 1/4 
xa [s [b b b]] 1/4 


1/3 
1/6 

1/4 
1/4 


2/3 

2/6 


1/4 

1/4 


P[f} = 
P = 


1/2 1/2 
1/2 1/2 


2/3 1/3 
2/3 1/3 


1/4 1/4 
1/2 1/2 



For example, in tree x\, rule 1 is used once and rule 3 is used twice. The 
empirical probability of x\ is 1/3, so x%s> contribution to p[fi] is 1/3 • 1, and 
its contribution to p[h] is 1/3 ■ 2. The weight fti is obtained from p[h] by 
normalizing among rules with the same lefthand side. For example, the expected 
rule frequencies p[fi] and p[h] of rules with lefthand side S already sum to 1, so 
they are adopted without change as (5\ and On the other hand, the expected 
rule frequencies p\h\ an d P[h] f° r rules with lefthand side B sum to 1/2, not 1, 
so they are doubled to yield weights /3s and j3$. It should be observed that the 
resulting weights are precisely the weights of model M\. 

It can be proven that the ERF weights are the best weights for a given 
grammar, in the sense that they define the distribution that is most similar 
to the empirical distribution. That is, if f3 are the ERF weights (for a given 
grammar), then D(p\\qp) < D(p\\qp>) for all sets of weights (3' ^ (3. 

As noted earlier, one might expect the best weights to yield D(p\q) = 0, but 
such is not the case. We have just seen, for example, that the best weights for 
grammar G\ yield distribution qi, yet D{p\qi) — 0.32 > 0. A close inspection of 
the distance calculation (||) reveals that q\ is sometimes less than p, but never 
greater than p. Could we improve the fit by increasing qi? For that matter, 
how can it be that q\ is never greater than pi As probability distributions, q\ 
and p should have the same total mass, namely, 1. Where is the missing mass 
for qi? 

The answer is of course that q\ and p are probability distributions over 
L(G), but not all of L(G) appears in the corpus. Two trees are missing, and 
they account for the missing mass. These two trees are: 

(11) S S 

A A A A 

a b b a 



Each of these trees have probability according to p (hence they can be ignored 
in the distance calculation), but probability 1/9 according to q\. 

Intuitively, the problem is this. The distribution q\ assigns too little weight 
to trees x\ and X2, and too much weight to the trees of (11); call them x§ and x§. 
Yet exactly the same rules are used in x$ and xq as are used in x± and X2- Hence 
there is no way to increase the weight for trees X\ and x%, improving their fit 
to p, without simultaneously increasing the weight for x$ and Xq, making their 
fit to p worse. The distribution qi is the best compromise possible. 

To say it another way, our assumption that the corpus was generated by a 
context-free grammar means that any context dependencies in the corpus must 
be accidental, the result of sampling noise. There is indeed a dependency in 
corpus @: in the trees where there are two A's, the A's always rewrite the 
same way. If corpus @ was generated by a stochastic context-free grammar, 
then this dependency is accidental. 

This does not mean that the context-free assumption is wrong. If we generate 
twelve trees at random from q\ , it would not be too surprising if we got corpus 
. More extremely, if we generate a random corpus of size 1 from q\ , it is quite 
impossible for the resulting empirical distribution to match the distribution q\. 
But as the corpus size increases, the fit between p and q± becomes ever better. 



3 Attribute- Value Grammars 



But what if the dependency in corpus @ is not accidental? What if we wish to 
adopt a grammar that imposes the constraint that both A's rewrite the same 
way? We can impose such a constraint by using an attribute-value grammar. 
Consider the following grammar, in which rewrite rules are now represented as 
feature structures. Let us call this grammar G2: 



(12) 



1. 



2. 



S 
1 



s 
1 



A 
1 

.4 
1 



[B] 



m 
m 



' A 

1 a 


5. 


' B 
1 a 


' A 


6. 


' B 


1 b 


1 b 



The language L{G2) is a set of dags, namely: 
(13) x l x 2 x 
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(The edges of the dags should actually be labelled with l's and 2's, but I have 
suppressed the edge labels for the sake of perspicuity.) 

3.1 AV Grammars and The ERF Method 

Now we face the question of how to attach probabilities to grammar G2 ■ The 
approach followed by Brew and Eisele is basically as follows |] Associate a weight 
with each of the six "rules" of grammar G2 ■ For example, let M2 be the model 
consisting of G 2 plus weights (ft,...,ft) = (1/2,1/2,2/3,1/3,1/2,1/2). The 
weight assigned to a tree x is then (as before) the product of the weights of the 
rules used in x. For example, the weight (fe^ifl assigned to tree x\ of (|l^) is 
2/9, computed as follows: 

(14) ,--"--, Pl 




Rule 1 is used once and rule 3 is used twice; hence 92(^1) = ft/33/33 = 1/2 ■ 2/3 ■ 
2/3 = 2/9. 

Observe that $2(^1) = ft/3|, which is to say, fll 1 ^ . Moreover, since 
0° = 1, it does not hurt to include additional factors pf^ X1 ^ for those i where 
fi(xi) = 0. That is, we can define qp corresponding to weights (3 = (ft, . . . , ft) 
generally as: 

n 

Now let us consider how to estimate weights. Brew and Eisele propose using 
the ERF method, as in the context-free case. To be sure, Brew and Eisele are 
more concerned about the case in which the training corpus consists of sentences 
alone, rather than parses (dags), and they concentrate on the application of 
the EM algorithm to estimate rule-frequency expectations in the absence of 
complete information. But their basic method is the ERF method: rule weights 
ft are set in accordance with the formula ft cx p[fi], under the constraint 
that the weights for rules with the same lefthand side sum to 1. The EM 
algorithm enters the picture only as a means of estimating p[fi] when it cannot 
be determined by simple counting. 

3 To be precise, neither Brew nor Eisele adopt the attribute- value framework discussed here, 
but the approaches they take in the related frameworks they do adopt are clearly analogous 
to the one I describe here. 

4 The reason for the ' " ' will be made clear shortly. 



11 



To illustrate, let us assume a corpus distribution for the dags ( |l3| ) analogous 
to the distribution in (fl): 



(15) 



Xi 



%2 X 3 



a; 4 



p= 1/3 1/6 1/4 1/4 
Using the ERF method, we estimate rule weights as follows: 



(16) 



p 


Pfi 


Pf2 


Pfz 


P/4 


Pfh 


Pfe 


xi 1/3 
x 2 1/6 
x 3 1/4 
Xi 1/4 


1/3 
1/6 


1/4 
1/4 


2/3 


2/6 


1/4 


1/4 


£[/] = 
/9 = 


1/2 
1/2 


1/2 
1/2 


2/3 
2/3 


1/3 
1/3 


1/4 
1/2 


1/4 
1/2 



This table is identical to the one given earlier in the context-free case. We arrive 
at the same weights we considered above for the AV grammar G2, yielding the 
distribution qi- 

3.2 Why the ERF Method Fails 

But at this point a problem arises: $2 is not a probability distribution. Unlike 
in the context-free case, the four trees in (|l^) constitute the entirety of L(G). 
This time, there are no missing trees to account for the missing probability mass. 
There is an obvious "fix" for this problem, as Brew and Eisele observe: we can 
simply normalize q 2 - (This, by the way, is the reason for the ' " ' in '172' — it is 
meant to indicate that q 2 is an "unnormalized" probability distribution.) That 
is, for the AV-grammar case, we must define the distribution qp corresponding 
to the weights f3 as: 



9/3 (x) = Tpquix) 
where Z is a normalizing constant defined as: 

y£L(G) 

In particular, for the ERF weights given in (||), we have Z = 2/9 
1/4+ 1/4 = 7/9. Dividing q 2 by 7/9 yields the ERF distribution: 



1/18 



(17) 



12 (x) 



X\ 

2/7 



X2 

1/14 



X3 XA 

9/28 9/28 
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On the face of it, then, we can transplant the methods we used in the context- 
free case to the AV case and the only problem that arises (q 2 not summing to 
1) has an obvious fix (normalization). However, something has actually gone 
very wrong. The theorem according to which the ERF method yields the best 
weights makes certain assumptions that we inadvertently violated by changing 
L(G) and re-apportioning probability via normalization. In point of fact, we can 
easily see that the ERF weights (|l6| ) are not the best weights for our example 
grammar. Consider the alternative model M* given in (18), defining probability 
distribution g*: 

[S A A] [S B] [A a a] [A b b] [B a a] [B b b] 

(18) 01= 02 = 03= 04= 05 = 06 = 

3+2V2 3 V2 1 1 1 

6+2V2 6+2V2 1 + V2 1+V2 2 2 

These weights are proper, in the sense that weights for rules with the same 
lefthand side sum to one. The reader can verify that q* sums to Z — and 
that q* is: 



(19) 



q*(x)= 1/3 1/6 1/4 1/4 



That is, q* = p. Comparing q 2 (the ERF distribution) and q* to p, we observe 
that D(p\\q 2 ) = 0.07 but D(p\\q*) = 0. 

In short, in the AV case, the ERF weights do not yield the best weights. 
This means that the ERF method does not converge to the correct weights as 
the corpus size increases. If there are genuine dependencies in the grammar, 
the ERF method converges systematically to the wrong weights. Fortunately, 
there are methods that do converge to the right weights. These are methods 
that have been developed for random fields. 



4 Random Fields 

A random field defines a probability distribution over a set of labelled graphs 
called configurations. In our case, the configurations are the dags generated 
by the grammar, i.e., Q — L(G).^ The weight assigned to a configuration is the 
product of the weights assigned to configuration properties.^ That is: 

5 Those familiar with random fields will recognize that identifying configurations with the 
dags of L(G) is not entirely unproblcmatic. For one thing, configurations are standardly taken 
to be labelings over a fixed graph, not graphs with varying topologies. For another thing, the 
configuration space is standardly taken to be finite, not countably infinite, as L(G) may be. 
These issues will be dealt with in the course of discussion. 

6 The standard term in the random-fields literature is feature; I use the term property to 
avoid confusion with feature in the sense of an attribute plus value. 
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tf(s)=IIfl 



/<(*) 



where /3j is the weight for property i and /, (x) is the frequency of occurence of 
property i in configuration a;. The probability of a configuration is proportional 
to its weight, and is obtained by normalizing the weight distribution. That is: 

z = E ye o?(y) 

If we identify properties of a configuration with the rules used in it, the 
random field model is almost identical to the model we considered in the previous 
section. There are two important differences. First, we no longer require weights 
to sum to one for rules with the same lefthand side. Second, we no longer require 
properties to be identical to the rules of the grammar. We use the grammar to 
define the set of configurations = L(G), but give ourselves more flexibility in 
choosing the properties of dags we would like to use to define the probability 
distribution over L(G). 

Let us consider an example. Let us continue to assume grammar G2 generat- 
ing language (|l3|), and let us continue to assume the empirical distribution (|l5|). 
But now rather than taking rule applications — local trees — to be properties, let 
us adopt the following two properties: 

(20) /£x 

1. N ! 2. (B) 

'■ a ; 



For purpose of illustration, take property 1 to have weight /3i = V2 and property 
2 to have weight fa = 3/2. The functions f\ and f2 represent the frequencies 
of properties 1 and 2, respectively: 



(21) 
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In short, we are able to exactly recreate the empirical distribution using fewer 
properties than before. Intuitively, we need only use as many properties as are 
necessary to distinguish among trees that have different empirical probabilities. 
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This added flexibility is welcome, but it does make parameter estimation 
more involved. Now we must not only choose values for weights, we must also 
choose the properties that weights are to be associated with. We would like 
to do both in a way that permits us to find the best model, in the sense of 
the model that minimizes the Kullback-Leibler distance with respect to the 
empirical distribution. Methods for doing both are given in a recent paper by 
Delia Pietra, Delia Pietra, and Lafferty ||. 



5 Field Induction 

In outline, the DDL algorithm is as follows: 

1. Start (t = 0) with the null field (no properties). 

2. Property Selection. Consider every property that might be added to 
the field qt and choose the best one. 

3. Weight Adjustment. Readjust weights for all properties. The result is 
a new field qt+i- 

4. Iterate until the field cannot be improved. 

One has a great deal of flexibility in defining the space of properties. For 
the sake of concreteness, let us take properties to be labelled subdags. In step 
2 of the algorithm we do not consider every conceivable labelled subdag (there 
are simply too many of them), but only the atomic (i.e., single-node) subdags 
and those complex subdags that can be constructed by combining properties 
already in the field or by combining a property in the field with some atomic 
property. 

In our running example, the atomic properties are: 
(22) ® © © 

Properties can be combined by adding connecting arcs. For example: 




5.1 The Null Field 

Field induction begins with the null field. With the corpus we have been as- 
suming, the null field takes the following form. 
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b 



a b 



q(x) 
q(x) 



1/4 



1 



1/4 



1 



1 1 

1/4 1/4 



Z = 4 



No dag x has any features, so q{x) = Yii fil is a product of zero terms, and 
hence has value 1. As a result, q is the uniform distribution. The Kullback- 
Leibler distance D(p\\q) is 0.03. The aim of property selection is to choose a 
property that reduces this distance as much as possible. 

The astute reader will note that there is a problem with the null field if 
L(G) is infinite. Namely, it is not possible to have a uniform distribution over 
an infinite set. If each dag in an infinite set of dags is assigned a constant 
nonzero probability e, then the total probability is infinite, no matter how small 
e is. There are a couple of ways of dealing with the problem. The approach that 
DDL adopt is to assume a consistent prior distribution p(k) over graph sizes k, 
and a family of random fields qk representing the conditional probability q(x\k); 
the probability of a tree is then p(k)q(x\k). All the random fields have the same 
properties and weights, differing only in their normalizing constants. 

I will take a slightly different approach here. Let us adopt an initial distribu- 
tion like that proposed by Brew and Eisele. There is a natural correspondence 
between AV grammars and CFG's, a correspondence that we implicitly adopted 
in earlier discussion. We assume that the rules of an AV grammar are typed 
feature structures in which all types (of toplevel feature structures) are disjoint. 
Types correspond to categories in a CFG, and the righthand side of the CF 
analogue of rule r is the list of types of immediate constituents of r, viewed as 
a feature structure. For example, the AV grammar Gi has corresponding CF 
grammar G\. 

In this framework, a model consists of: (1) An AV grammar G whose pur- 
pose is to define a set of dags L(G). (2) An SCFG H derived from G, with 
weights 9, defining a distribution p{d) over derivations d. There is a unique 
derivation corresponding to each dag in L(G), but some derivations correspond 
to no well-formed dag — intuitively, some derivations lead to unification failures. 
Discarding the bad derivations and renormalizing yields the initial distribution 
p(x) over dags L(G). (3) A set of properties / with weights /?, to define the 
final distribution q(x) = ^ JJ { /3?'^p(x). 

There are a couple possible choices of weights 9 for the initial distribution. 
The easiest approach would be to adopt the ERF weights. Field induction 
would then be a way of adding context-sensitivities to the ERF distribution. An 
alternative would be to adopt maximum-entropy weights. The intuitive reason 
for adopting the uniform distribution (in the finite case) is that it distinguishes 
dags in L(G) from dags not in L(G), but otherwise makes no assumptions 
about the distribution. The uniform distribution maximizes entropy over a 
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finite set. Maximizing entropy is more generally applicable, however, and can 
be applied to infinite sets as well. Maximum entropy distributions for context- 
free languages are discussed in a paper by Miller and O'Sullivan 0, though a 
number of technical questions arise that I do not wish to pursue here. 

5.2 Property Selection 

At each iteration, we select a new property / by considering all atomic properties 
and all complex properties that can be constructed from properties already in 
the field. Holding the weights constant for all old properties in the field, we 
choose the best weight (3 for / (how (3 is chosen will be discussed shortly), 
yielding a new distribution qf = q$fi- The score for property / is the reduction 
it permits in D(p|<7 id), where g id is the old field. That is, the score for / is 
-D(p||<7oid) — D(p\\q-f). We compute the score for each candidate property and 
add to the field that property with the highest score. 

To illustrate, consider the two atomic properties 'a' and 'B'. Given the null 
field as old field, the best weight for 'a' is = 7/5, and the best weight for 'B' 

s 

B 
I 

1/4 

1 Z = 24/5 
5/24 

0.05 D = 0.01 

1 Z = 4 
1/4 

D = 0.03 

The better property is 'a', and 'a' would be added to the field if these were the 
only two choices. 

Intuitively, 'a' is better than 'B' because 'a' permits us to distinguish the 
set {xi,xs} from the set {x2,X4}; the empirical probability of the former is 
1/3 + 1/4 = 7/12 whereas the empirical probability of the latter is 5/12. Distin- 
guishing these sets permits us to model the empirical distribution better (since 
the old field assigns them equal probability, counter to the empirical distribu- 
tion). By contrast, the property 'B' distinguishes the set {xi, X2} from {#3, X4}. 
The empirical probability of the former is 1/3 + 1/6 = 1/2 and the empirical 
probability of the latter is also 1/2. The old field models these probabilities 
exactly correctly, so making the distinction does not permit us to improve on 
the old field. As a result, the best weight we can choose for 'B' is 1, which is 
equivalent to not having the property 'B' at all. 



is (3 — 1. This yields q and D(p\\f) as follows: 



(25) 
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5.3 Selecting the Initial Weight 

DDL show that there is a unique weight that maximizes the score for a new 
property / (provided that the score for / is not constant for all weights), and 
that the maximizing weight is the solution to the equation 

(26) q f Af}=P[f] 

in the single unknown /3. Intuitively, we choose the weight such that the expec- 
tation of / under the resulting new field is equal to its empirical expectation. 

Solving equation ( p6| ) for j3 is easy if L(G) is small enough to enumerate. 
Then the sum over L(G) that is implicit in <7/,/3[/] can be expanded out, and 
solving for (3 is simply a matter of arithmetic. Things are a bit trickier if L{G) 
is too large to enumerate. DDL show that we can solve equation ( p6| ) if we can 
estimate q \d[f = k] for k from to the maximum possible value for /. 

We can estimate q id[f = k] by means of random sampling. The idea is 
actually rather simple: to estimate how often the property appears in "the 
average dag" , we generate a representative mini-corpus from the distribution 
<7oid and count. That is, we generate dags at random in such a way that the 
relative frequency of dag x is q \d(x) (in the limit), and we count how often the 
property of interest appears in dags in our generated mini-corpus. 

The application that DDL consider is the induction of English orthographic 
constraints — inducing a field that assigns high probability to "English-sounding" 
words and low probability to non-English-sounding words. For this application, 
Gibbs sampling is appropriate. Gibbs sampling does not work for the application 
to AV grammars, however. Fortunately, there is an alternative random sampling 
method we can use: Metropolis-Hastings sampling. We will discuss the issue in 
some detail shortly. 

5.4 Readjusting Weights 

When a new property is added to the field, the best value for its initial weight 
is chosen, but the weights for the old properties are held constant. In general, 
however, adding the new property may make it necessary to readjust weights 
for all properties. The second half of the DDL algorithm involves finding the 
best set of weights for a given set of properties. 

The method is very similar to the method for selecting the initial weight for 
a new property. Let (7!, . . . , 7„) be the old weights for the properties. Consider 
the equation 

(27) g 7 [/3f # / 4 ] =p[fi\ 

where /#(cc) = J^i fi( x ) is the total number of properties of dag x. Without 
going into exactly why fi is weighted as it is on the lefthand side, the idea is the 
same as before: we want to adjust Pi so that the average number of instances 
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of property /, according to the model matches the average number of instances 
of property /j in dags in the corpus. 

If the weights 71,. . . ,7„ are not already as good as they can be, solving 
equation ( |27| ) for ft (for each i) is guaranteed to improve the weights, but it 
does not necessarily immediately yield the globally best weights. We can obtain 
the globally best weights by iterating. Set 7, *— ft, for all i, and solve equation 
( p7|) again. Repeat until the weights no longer change. 

As with equation (|26|), solving equation (|27| ) is straightforward if L(G) is 
small enough to enumerate, but not if L(G) is large. In that case, we must 
use random sampling. We generate a representative mini-corpus and estimate 
expectations by counting in the mini-corpus. 

5.5 Random Sampling 

We have seen that random sampling is necessary both to set the initial weight for 
properties under consideration and to adjust all weights after a new property is 
adopted. Random sampling involves creating a corpus that is representative of 
a given model distribution q{x). To take a very simple example, a fair coin can 
be seen as a method for sampling from the distribution q such that q(H) = 1/2, 
q(T) = 1/2. Saying that a corpus is representative is actually not a comment 
about the corpus itself but the method by which it was generated: a corpus 
representative of distribution q is one generated by a process that samples from 
q. Saying that a process M samples from q is to say that the empirical distri- 
butions of corpora generated by M converge to q in the limit. For example, if 
we flip a fair coin once, the resulting empirical distribution over (if, T) is either 
(1,0) or (0,1), not the fair-coin distribution (1/2,1/2). But as we take larger 
and larger corpora, the resulting empirical distributions converge to (1/2, 1/2). 

One of the advantages of SCFGs, that is lost when we go to random fields, is 
that there is a transparent relationship between an SCFG defining a distribution 
q and a sampler for q. We can sample from the distribution defined by an SCFG 
as follows. Consider the grammar (|J), repeated here as (28): 
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3. 


A - 


-> a 
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-> b b 


ft 


= 1/2 



The language of (28) consists of the six trees {x\ — [g [a a] [a a]], x 2 — [g [b 
a a]], x 3 = [ s [a b] [ A b]], x 4 = [ s [ B b b]], x 5 = [ s [ A a] [ A b]], x 6 = [ s [a b] 
[a a]]} with probability distribution q : x\ 1— » 2/9, X2 1— > 1/4, X3 1— > 1/18, X4 1— > 

1/4,355 >-> 1/9,016 I — ^ 1/9. 
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We sample from q via stochastic derivations. In a stochastic derivation, we 
start with the start symbol, S. There are two rules expanding S: S — > A A and 
S — > B. We flip a coin to choose between them, heads for A A, tails for B. 
Suppose the coin comes up heads. We expand S to A A, and then expand each 
of the A's in turn. To expand the first A, we consider the two rules A — > a and 
A — > b. To decide between them, we flip a loaded coin that comes up heads 
(A — > a) 2/3 of the time and tails (A — ► b) 1/3 of the time. Suppose this coin 
also comes up heads. We rewrite the first A as a and go to the second A. We 
flip the loaded coin again; suppose it comes up heads again. We rewrite the 
second A as a, and the result is tree x\. The chances of throwing three heads 
in this manner are 1/2-2/3-2/3 = 2/9 = q[x{). If we sample repeatedly in 
this manner, the proportion of tree x\ in the resulting corpus will converge to 
2/9. This is the sense in which stochastic derivations of this sort sample from 
the distribution defined by the given SCFG. 

When we went from SCFGs to random fields, we lost the transparent con- 
nection between the probability distribution defined by the field and a method 
for sampling from it. Since weights do not sum to one for rules with the same 
lcfthand side — indeed, since the properties with which weights are associated 
are not even necessarily rule applications — we cannot sample in the same way 
as we sample from an SCFG. 

There is, however, a method that can be adapted for sampling from the 
random field defining a probability distribution over the language of an AV 
grammar. This method is the Metropolis-Hastings algorithm. Specifically, in 
the case of sets of dags with probability distribution q, we proceed as follows. 

Recall that we have a grammar G consisting of feature structures. We also 
have a context-free analogue H of G with weights 9, which we use to define 
the initial distribution p(x). In addition, we have a field consisting of a set 
of properties /, with weights /3,. The grammar defines a set of ft = L(G) 
and the field plus initial distribution define a probability distribution q(x) = 

^Ui^Pix) over a 

We can sample from the initial distribution p(x) by performing stochastic 
derivations using grammar H. The derivations map to dags in L(G) according 
to the correspondence between context-free rules and the AV rules of G. It is 
possible that some of the derivations will fail that they will map to inconsistent 
dags. Those derivations are simply discarded. That is, the probability that H 
assigns to a derivation is actually p(x); when we throw away derivations that 
map to inconsistent dags, the result is to restrict p(x) to consistent dags and 
normalize it, so that we end up sampling from p(x). 

In this way, we can sample from L(G), but not in accordance with the field 
probability q(x). The essence of the Metropolis-Hastings algorithm is a means 
of converting the sampler for p(x) into a sampler for q(x). Suppose we are 
generating a corpus, and have generated dags x\, . . . , x n . Now we wish to add 
another dag, x n +\, to the corpus. We generate a dag y at random using the 
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sampler for p(-). Now, instead of simply adding y to the corpus, we flip a loaded 
coin, that comes up heads with probability 



(29) A{y\x) 



min{ 1 , 



q{y)p{x n ) 



} 



If the coin comes up heads, we do include y in the corpus, that is, x n +i = y- 
But if the coin comes up tails, we throw y away and make a copy of x n instead, 
that is, x n +\ — x n . 

The acceptance probability A(y\x) reduces in our case to a particularly sim- 
ple form. If q(y)p(x n ) > q(x n )p(y), then obviously A(y\x) = 1. Otherwise, 
writing F(x) for the "field weight" fL/?/* > we have: 



It can be shown that the result of generating a new dag with probability 
p(-) and accepting it with probability A(-|x„) yields a sampler for g(-) (see e.g. 
Winkler [Q). The final "acceptance" step intuitively serves the role of "punish- 
ing" dags that the p-sampler proposes more often than a q-sampler would, and 
shifting their probability to dags that the p-sampler would propose less often 
than a g-sampler would. 

In somewhat more detail, if we think of the corpus xi, xi , ■ . ■ as a random 
walk through the space L(G), the Metropolis-Hastings algorithm works because 
it forces the random walk to spend time in a region R proportional to the 
probabiliy of R. This is accomplished, intuitively, by preservation of what is 
known as detailed balance. Detailed balance requires that the probability of 
making a transition from dag x to dag y in the course of the random walk 
should balance the probability of making a transition from dag y to dag x. 

Let q(x) be, as always, the model probability that we wish to sample from 
and let q(y\x) be the transition probability — the probability of the next dag in 
the corpus being y if the previous dag is x. In our case, q(y\x) (for y ^ x) is 
the probability that we generate y at random, and then also accept it: q(y\x) = 
p(y)A(y\x). Define q(x,y) (for x ^ y) to be the joint probability that x is 
the previous dag and y is the next dag; that is, q{x,y) — q(x)q(y\x). Detailed 
balance requires that q{x,y) = q(y,x). If detailed balance is preserved, it can 
be shown that the empirical distribution of the corpus generated by the random 
walk converges to q(-), and that the expectation of a function / taken with 
respect to the empirical distributions converges to q[f]. 

We can see that the transition probability we have assumed does indeed 
preserve detailed balance, as follows. Let x be the last-generated tree and y the 
new tree, and suppose that q(y)p(x) > q(x)p(y). Then: 



A(y\x) 



F(y)/F(x n ) 



Z- 1 F(y) P (y)p(x 7l ) 
Z-^F(x n )p(x n )p(y) 
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q(x,y) = 



q{y\x) 



p(y) 



q(x)p(y) 



q{x\y) 




That is, q(x,y) = q(y,x) and detailed balance is confirmed. The remaining 
cases q{y)p{x) < q{x)p{y) and q(y)p(x) = q(x)p(y) are similar and are left as 
an exercise for the reader. 

6 Final Remarks 

In summary, we cannot simply transplant CF methods to the AV grammar case. 
In particular, the ERF method yields correct weights only for SCFGs, not for 
AV grammars. We can define a probabilistic version of AV grammars with a 
correct weight-selection method by going to random fields. Property selection 
and weight adjustment can be accomplished using the DDL algorithms. In 
property selection, we need to use random sampling to find the initial weight 
for a candidate property, and in weight adjustment we need to use random 
sampling to solve the weight equation. The random sampling method that DDL 
used is not appropriate for sets of dags, but we can use the Metropolis-Hastings 
method. 

As a closing note, it should be pointed out explicitly that the random field 
techniques described here can also be profitably applied to context-free gram- 
mars. As Stanley Peters nicely put it, there is a distinction between possibilis- 
tic and probabilistic context-sensitivity. Even if the language described by the 
grammar of interest — that is, the set of possible trees — is context-free, there 
may well be context-sensitive statistical dependencies. Random fields can be 
readily applied to capture such statistical dependencies whether or not L(G) is 
context-sensitive . 
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